Lecture 04
Lecture 3: Review
- Introduction to histograms or frequency distributions
- Probability Distribution Functions (PDF)
- Descriptive Statistics
Center - mean, median, mode
Spread - range, variance, standard deviation
Our last graphs
Lecture 4: Lecture Overview
The objectives:
- Introduction to hypothesis testing
- The standard normal distribution
- Standard error
- Confidence intervals
- Student’s t-distribution
- H testing sequence
- p-values
Lecture 4: Standard normal distribution
To understand hypothesis testing need to understand standard normal distribution
Recall - sculpin in Toolik Lake
n = 208
mean = 51.69 mm
std dev = s = 12.02 mm
Weight distribution ~normal
Lecture 4: Standard normal distribution
You want to know things about this population like
- probability of a baby born at the hospital having a certain length (e.g., > 60 mm)
- Can solve this by integrating under curve
- But it is tedious to do every time
- Instead
- we can use the standard normal distribution (SND)
Lecture 4: Standard normal distribution
Standard Normal Distribution
- “benchmark” normal distribution with µ = 0, σ = 1
- The Standard Normal Distribution is defined so that:
~68% of the curve area within +/- 1 σ of the mean,
~95% within +/- 2 σ of the mean,
~99.7% within +/- 3 σ of the mean
*remember σ = standard deviation
Lecture 4: Standard normal distribution
Areas under curve of Standard Normal Distribution
- Have been calculated for a range of sample sizes
- Can be looked up in z-table
- No need to integrate
- Any normally distributed data can be standardized
- transformed into the standard normal distribution
- looked up ion a table
Lecture 4: Standard normal distribution
Done by converting original data points to z-scores
- Z-scores calculated as:
\(\text{Z = }\frac{X_i-\mu}{\sigma}\)
- z = z-score for observation
- xi = original observation
- µ = mean of data distribution
- σ = SD of data distribution
Lecture 4: Standard normal distribution
Thus:
- z-score = value - mean/s
- z-score of 25mm = (25 - 51.7) / 12 = -2.225
- z-score of 51.7mm = (51.7 - 51.7) / 12 = 0
- z-score of 60mm = (60 - 51.7) / 12 = 0.6916667
Lecture 4: Standard normal distribution
Area under curve (probability) of standard normal distribution is known relative to z-values
Knowing z-value, can figure out corresponding area under the curve
What is the area under curve < 0?
Lecture 4: Standard normal distribution
Here is z-score table for right side or positive values of the z distribution (z > 0)
Numbers give area under the curve to left of a particular z-score
say 60 mm as a z score of 0.6916667
Lecture 4: Standard normal distribution
Area under curve (probability) of standard normal distribution is known relative to z-values
Knowing z-value, can figure out corresponding area under the curve
What is the area under curve < 0?
- 0.5 of the area of the curve is contained to the left of z = 0.00
Lecture 4: Standard normal distribution
area of the curve is contained to the left of z = 1.22
- 0.8686 or 86.9%
Lecture 4: Standard normal distribution
What is the area of the curve is contained between of z = 0 and z=1.5?
Lecture 4: Standard normal distribution
What is the area of the curve is contained between of z = 0 and z=1.5?
- approximately 0.4332 (or 43.32%)
To calculate this from a standard normal table:
To find the area under the standard normal curve between 0 and 1.5 using this table:
- Locate z = 1.5 in the table - 0.9332.
- represents P(Z ≤ 1.5) - probability Z is less than or equal to 1.5
- Since need area between 0 and 1.5 - need to subtract P(Z ≤ 0) from P(Z ≤ 1.5)
- From table - P(Z ≤ 0) = 0.5000.
- Therefore, the area between 0 and 1.5 is: 0.9332 - 0.5000 = 0.4332.
Lecture 4: Standard normal distribution
What is the area of the curve is contained to the left of z = -1?
Lecture 4: Standard normal distribution
What is the area of the curve is contained to the left of z = -1?
- Locate row for 1.0 - (table shows absolute value of z) and the column for .00
- value = 0.8413 - represents P(Z ≤ 1.0)
- However want P(Z ≤ -1.0)
need to use the symmetry property of the standard normal distribution:
P(Z ≤ -1.0) = 1 - P(Z ≤ 1.0) = 1 - 0.8413 = 0.1587
Therefore, 15.87% of area falls to the left of z = -1.0
Lecture 4: Standard error
Take random samples from fish population:
3 random samples (each n=20) from population:
Notice the sample statistics and distributions
Lecture 4: Standard error
Every sample gives slightly different estimate of µ
- Can take many samples and calculate means
- plot the frequency distribution of means
- get the “sampling distribution of means”
Lecture 4: Standard error
3 important properties:
- Sampling distribution of means (SDM) from normal population will be normal
- Large Sampling distribution of means from any population will be normal (Central Limit Theorem)
- The mean of Sampling distribution of means will equal µ or the mean
Lecture 4: Standard error
Given above
can estimate the standard deviation of sample means
“Standard error of sample mean”
How good is your estimate of population mean? (based on the sample collected)
quantifies how much the sample means are expected to vary from samples
gives an estimate of the error associated with using \(\bar{y}\) to estimate \(\mu\)…
Lecture 4: Standard error
\(\sigma_{\bar{y}} = \frac{\sigma}{\sqrt{n}}\)
but rarely know σ, so use s \(s_{\bar{y}} = \frac{s}{\sqrt{n}}\) Where: \(s_{\bar{y}}\) = sample standard error of mean s = sample standard deviation n = sample size
Lecture 4: Standard error
Notice: - \(s_{\bar{y}}\) depends on - sample s (standard deviation) - sample n - (\(s_{\bar{y}} = \frac{s}{\sqrt{n}}\))
How and why? - Decreases with sample n - number - increases with sample s - standard deviation
- Large sample, low s = greater confidence in estimate of \(\mu\)
Lecture 4: Confidence intervals
Every sample gives slightly different estimate of µ (population mean)
Want to know how accurate our estimate of µ is from a sample
Do this by calculating confidence interval:
- Range of values that will contain the true population mean with a certain probability
Lecture 4: Confidence intervals
Often calculate 95% CIs
- Interpret 95% CI to mean:
- Range of values that contains µ (population mean) with 95% probability
- More correctly:
- If we took 100 samples from population
- calculate a CI from each
- 95 of the 100 CIs will contain the true population mean - µ
asdfasfasd
Lecture 4: Confidence intervals
Formula for confidence interval
\(\text{95% CI} = \bar{y} \pm z \cdot \frac{\sigma}{\sqrt{n}}\)
Where:
- ȳ is the sample mean
- 𝑛 is the sample size
- σ is the population standard deviation
- z is the z-value corresponding the probability of the CI
Lecture 4: Confidence intervals
Formula for confidence interval
\(\text{95% CI} = \bar{y} \pm z \cdot \frac{\sigma}{\sqrt{n}}\)
95% of probability of SND is bw z= -1.96 and z=1.96
So for:
- 95% CI z = 1.960
- 90% CI z = 1.645
- 99% CI z = 2.576
- And so on….
Lecture 4: Confidence intervals
In the more typical case DON’T know the population σ - estimate it from the sample s When don’t know the population σ - and when sample size is < ~30) - can’t use the standard normal (z) distribution
Instead, we use Student’s t distribution
Lecture 4: Student’s t-distribution
Student’s t distribution similar to SND
- changes depending on degrees of freedom (df= n-1)
- t distribution more “conservative”
- smaller n is, the more conservative the t distribution is
At df = ~30 - t distribution becomes close to z distribution
Lecture 4: Student’s t-distribution
To calculate CI for sample from “unknown” population:
\(\text{CI} = \bar{y} \pm t \cdot \frac{s}{\sqrt{n}}\)
Where:
- ȳ is sample mean
- 𝑛 is sample size
- s is sample standard deviation
- t t-value corresponding the probability of the CI
- t in t-table for different degrees of freedom (n-1)
Lecture 4: Student’s t-distribution
Here is a t-table
- Values of t that correspond to probabilities
- Probabilities listed along top
- Sample dfs are listed in the left-most column
- Probabilities are given for one-tailed and two-tailed “questions”
Lecture 4: Student’s t-distribution
One-tailed questions: area of distribution left or (right) of a certain value
- n=20 (df=19) - 90% of the observations found left
- t= 1.328 (10% are outside)
Lecture 4: Student’s t-distribution
Two-tailed questions refer to area between certain values
- n= 20 (df=19), 90% of the observations are between
- t=-1.729 and t=1.729 (10% are outside)
Lecture 4: Student’s t-distribution
Let’s calculate CIs again:
Use two-sided test
- 95% CI Sample A: = 51.7 ± 1.984 * (12/(208^0.5)) = 1.650788
- The 95% CI is between 50.05 and 53.35
- “The 95% CI for the population mean from sample A is 51.7 ± 1.65”
Lecture 4: Student’s t-distribution
So:
- Can assess confidence that population mean is within a certain range
- Can use t distribution to ask questions like:
- “What is probability of getting sample with mean = ȳ from population with mean = µ?“ (1 sample t-test)
- “What is the probability that two samples came from same population?” (2 sample t-test)
Lecture 4: Next steps
For example
- what is probability that population X is the same as our lakes population?
How would you assess this question using what we learned?
Lecture 4: Next steps
Let’s calculate the 95% CI for population X
Use two-sided test
95% CI Sample X: = 54 ± 1.984 * (10.9/(132^0.5)) = 1.882267 The 95% CI is between 52.12 and 55.88
Notice: the 95% confidence interval contains 51.7
- What does this tell us about population X?
Lecture 4: Statistical hypothesis testing
Major goal of statistics:
inferences about populations from samples assign degree of confidence to inferences
Statistical H-testing:
formalized approach to inference
- hypotheses ask whether samples come from populations with certain properties
- often interested in questions about population means (but not only)
Lecture 4: Statistical hypothesis testing
Relies on specifying null hypothesis (Ho) and alternate hypothesis (Ha)
- Ho is the hypothesis of “no effect”
- (two samples from population with same mean, sample is from population of mean=0)
- Ha (research hypothesis) the opposite of the Ho
Lecture 4: Statistical hypothesis testing
- p = 0.3 means that if study repeated 100 times
- would get this (or more extreme) result due to chance 30 times
- p = 0.03 means that if study repeated 100 times
- would get this (or more extreme) result due to chance 3 times
Which p-value suggests Ho likely false?
Lecture 4: Statistical hypothesis testing
At what point reject Ho?
p < 0.05 conventional “significance threshold” (α)
p < 0.05 means:
- if Ho is true - if study repeated 100 times
- would get this (or more extreme) result less than 5 times due to chance
- if Ho is true - if study repeated 100 times
Lecture 4: Statistical hypothesis testing
α is the rate at which we will reject a true null hypothesis (Type I error rate)
Lowering α will lower likelihood of incorrectly rejecting a true null hypothesis (e.g., 0.01, 0.001)
Both hypotheses and α are specified BEFORE collection of data and analysis
Lecture 4: Statistical hypothesis testing
Traditionally α=0.05 is used as a cut off for rejecting null hypothesis
Nothing magical about 0.0 - actual p-values need to be reported.
| p-value range | Interpretation |
|---|---|
| P > 0.10 | No evidence against Ho - data appear consistent with Ho |
| 0.05 < P < 0.10 | Weak evidence against the Ho in favor of Ha |
| 0.01 < P < 0.05 | Moderate evidence against Ho in favor of Ha |
| 0.001 < P < 0.01 | Strong evidence against Ho in favor of Ha |
| P < 0.001 | Very strong evidence against Ho in favor of Ha |
Lecture 4: Statistical hypothesis testing
Fisher:
p-value as informal measure of discrepancy betwen data and Ho
“If p is between 0.1 and 0.9 there is certainly no reason to suspect the hypothesis tested. If it is below 0.02 it is strongly indicated that the hypothesis fails to account for the whole of the facts. We shall not often be astray if we draw a conventional line at .05 …”
s
Lecture 4: Statistical hypothesis testing
General procedure for H testing:
- Specify Null (Ho) and alternate (Ha)
- Determine test (and test statistic) to be used
- Test statistic is used to compare your data to expectation under Ho (null hypothesis)
- Specify significance (α or p value) level below which Ho will be rejected
Lecture 4: Statistical hypothesis testing
General procedure for H testing:
- Collect data - Perform test
- If p-value < α, conclude Ho is likely false and reject it
- If p-value > α, conclude no evidence Ho is false and retain it